Homework 5¶

Maria Bochenek

Task 1¶

Consider a following model:

$$f(x_1, x_2) = (x_1 + x_2)^2$$

Assume that $X_1, X_2 \sim \mathit{U}[-1,1]$ and $x_1=x_2$ (full dependency)

Calculate $PD$ (Partial Dependence) profile for variable $x_1$ in this model.

Extra task if you do not fear conditional expected values: Calculate $ME$ (Marginal Effects) and $ALE$ (Accumulated Local Effects) profiles for variable $x_1$ in this model.

Let's note that marginal probability distibutions for $X_1$ and $X_2$ are given by $$\begin{align*} h_{X_1} (x_1) &= \begin{cases} \frac{1}{2}, \text{ for} -1 \leq x_1 \leq 1 \\ 0, \text{ otherwise} \end{cases}\\ h_{X_2} (x_2) &= \begin{cases} \frac{1}{2}, \text{ for} -1 \leq x_2 \leq 1 \\ 0, \text{ otherwise} \end{cases} \end{align*}$$

Joint probablity distribution of $X_1$ and $X_2$ must satisfy

$$h_{X_1, X_2} (x_1, x_2) = h_{X_1 | X_2} (x_1|x_2) h_{X_2}(x_2) = h_{X_2 | X_1} (x_2|x_1) h_{X_1}(x_1).$$

Moreover conditional distribution of $X_2$ given $X_1 = z$ is given by $$h_{X_2 | X_1=z} (x_2|x_1=z) $$

Partial dependence¶

$$\begin{align*} g_1^{PD}(z) &= \mathbb{E}_{X_2}\left[z^2 + 2zx_2 + x_2^2\right] \\ &= \mathbb{E}[z^2] + \mathbb{E}[2zx_2] + \mathbb{E}[x_2^2] \\ &= z^2 + 2z \mathbb{E}[x_2] + \mathbb{E}[x_2^2] \\ &= z^2 + 2z \int_{-\infty}^{-\infty} x_2 h_{X_2}(x_2) \mathbb{1}_{[-1, 1]}(x_2)dx_2 + \int_{-\infty}^{-\infty} x_2^2 h_{X_2}(x_2) \mathbb{1}_{[-1, 1]}(x_2) dx_2 \\ &= z^2 + 2z \int_{-1}^{1} \frac{x_2}{2}dx_2 + \int_{-1}^{1} \frac{x_2^2}{2}dx_2 \\ &= z^2 + z \int_{-1}^{1} x_2 dx_2 + \frac{1}{2} \int_{-1}^{1} x_2^2 dx_2 \\ &= z^2 + z \left[\frac{x_2^2}{2}\right]_{-1}^{1} + \frac{1}{2} \left[\frac{x_2^3}{3}\right]_{-1}^{1} \\ &= z^2 + z \left[\frac{1}{2} - \frac{1}{2}\right] + \frac{1}{2}\left[\frac{1}{3} + \frac{1}{3}\right] \\ &= z^2 + \frac{1}{3} \end{align*}$$

Marginal Effects¶

$$\begin{align*} g_1^{MP}(z) &= \mathbb{E}_{X_2 | X_1=z} \left[z^2 + 2zx_2 + x_2^2\right] \\ &= z^2 + 2z \mathbb{E}_{X_2 | X_1=z}[x_2] + \mathbb{E}_{X_2 | X_1=z}[x_2^2] \\ % &= z^2 + 2z \int_{-\infty}^{-\infty} x_2 h_{X_2|X_1=z}(x_2|x_1=z) \mathbb{1}_{[-1, 1]}(x_2)dx_2 + \int_{-\infty}^{-\infty} x_2^2 h_{X_2|X_1=z}(x_2|x_1=z) \mathbb{1}_{[-1, 1]}(x_2) dx_2 \\ % & = z^2 + 2z \int_{-1}^{1} \frac{z}{2} dx_2 + \frac{1}{2} \int_{-1}^{1} z_2^2 dx_2 \\ &= z^2 + 2z^2 + z^2 \\ &= 4z^2 \end{align*}$$

Accumulated local effects¶

$$\begin{align*} g_1^{AL}(z) &= \int_{-1}^z \mathbb{E}_{X_2 | X_1=v} \frac{\partial(x_1 + x_2)^2}{\partial x_1}dv \\ &= \int_{-1}^z \mathbb{E}_{X_2 | X_1=v} 2(x_1+x_2)dv \\ &= \int_{-1}^z 4v dv \\ &= 4 \left[\frac{v^2}{2}\right]_{-1}^{z}\\ &= 4 (\frac{z^2}{2} - \frac{1}{2}) = 2(z^2 - 1) \end{align*}$$

Task 2¶

Dataset¶

In this assigment we will again use QSAR-biodegradation dataset LINK for binary classification task. The dataset has 17 features and 1055 rows. We changes classes labels from $\{1, 2\}$ to $\{0, 1\}$ for consistency with the procedure in HW1 (it was necessary to use BCELoss while training Neural Network Classifier). Class 0 has $699$ instances and class 1 has $356$, thus this dataset's imbalance ratio is equal to $1.96$. The task is to distinguish ready (class 1) and not ready (class 0) biodegradable molecules based on molecular stucture descriptors.

Models¶

We selected sklearn.ensemble.GradientBoostingClassifier, which was one of better performing models based on analysis from HW1. To extend the scope of our analysis we included sklearn.linear_model.LogisticRegression as an example of model with worse performance for comparison.

We set random seed for reproductibility and split dataset into training and validation datasets 80:20. We performed feature scaling using sklearn.preprocessing.StandardScaler. Model performance is presented in table below.

recall precision f1 accuracy auc
GradientBoostingClassifier 0.898305 0.688312 0.779412 0.85782 0.92568
LogisticRegression 0.762712 0.714286 0.737705 0.848341 0.899309

Selecting observations for analysis¶

We selected 2 observations from test datasets, one from each class. The selected observations are presented below.

V1 V2 V8 V12 V13 V14 V15 V17 V18 V22 V27 V28 V30 V31 V36 V37 V39 TARGET
261 0.570528 -1.02671 -0.199308 2.14755 0.090398 0.734119 0.567574 -0.658983 0.114451 -0.294956 0.0880715 -0.00262827 0.152555 0.565992 -0.49876 -0.447733 -0.311315 1
679 1.2897 -0.922888 0.170087 2.2257 -0.573715 -0.644963 0.699339 -1.06074 -0.113295 0.506875 1.38873 -0.0335233 -0.751856 -0.876388 -0.267852 -1.32678 -0.50768 0

The predictions for the selected observations are presented below.

261 679
TARGET 1 0
GradientBoostingClassifier 1 1
LogisticRegression 0 0

It's important to note that even though selected variables come from 2 different classes, both models recognize them as the same class. The difference being GradientBoostingClassifier predicts that both observations come from class 1, while LogisticRegression predicts that both belong to class 0.

What-if explanations of predictions using Ceteris Paribus profiles¶

The previously selected observations have different CP profiles which is best illustrated by $V1, V12, V15, V27$ variables.

cp_261_vs_679.png

For most variables, the difference between the model's predictions for the two selected observations is stable, just like when $V12 < 1.7$. However, we can see that when $V12 > 1.7$ even though predictions increase for both observations, the increase rate for observation $679$ is greater than for $267$, hence for $V12 > 2.2$ the prediction for $679$ is greater than for $267$, even though we would expect $679$ to have lower prediction since it belongs to class 0.

Another example of unstable behaviour that distinguishes CP profiles of the selected observations is the fact that for observation $261$ model prediction is rapidly decreasing when $V1$ is larger than 0.3, while prediction is increasing for observation $679$. Additionally for $0.7 < V15 < 0.9$ prediction for $261$ decreases, while for $679$ it increases. Then for $V15 > 0.9$, the predictions are stable again, however, the absolute difference between their values is larger than it was for $V15 < 0.7$. Similarly, for $V27$, the predictions for both observations are quite stable when $V27 < 0.6$ and $V27 > 1.5$, however, for $0.6 < V27 < 0.7$ prediction for observation $261$ rapidly decreases (later it stabilises at the similar level that prediction for $679$ used to have) meanwhile prediction for $679$ increases and later stabilises at the similar value that $261$'s prediction used to have. This behaviour of predictions might explain why we observe misclassification of those two observations.

Interestingly $V1$ and $V12$ were identified as the variables of high importance and had positive attribution for observation $879$ belonging to class 1, while $V1$ and $V27$ were of high importance and had negative for observation $1002$ belonging to class 0.

CP vs PDP¶

pdp_important_vs_not.png

Above we can see an example of a PDP profile that indicates the high importance of the variable (left, $V18$) and a PDP profile that suggests low importance (right, $V17$).

Since CP is a local explanation, while PDP is global. Let's first explore the PDP for $V1, V12, V15$ and $V27$, which were variables with the most different CP profiles for observations $261$ and $679$.

pdp_V1_V12_V15_V27.png

The fact that predictions are stable for $V15$ based on the PDP profile is in contrast with previously analysed CP profiles, which could suggest that $V15$ is an important predictive variable. This is a good example of the fact that even though locally CP profiles might suggest that the effect of certain variables on predictions is significant it might not be in agreement with the PDP profile for this variable.

PDP profile analysis shows that $V1$ and $V12$ have a bigger impact on predictions in comparison to $V15$. The impact of $V27$ value on prediction seems to be smaller than $V1$ and $V12$ and larger than $V15$ based on PDP profiles. Moreover, $V1$ and $V12$ values have a wider range than $V15$ and $V27$.

Model comparison¶

For variables $V18$ and $V17$ their level of importance indicated by PDP profiles is the same for both models ($V18$ - high, $V17$ - low) as can be seen below.

pdp_LR_GB_important_vs_not.png

While comparing the PDP profiles between the models the first thing worth noting is that LogisticRegression has smooth variable explanations, while GradientBoostingClassifier's variable explanations are step functions with some variability.

For $V18$ the largest difference between models is when $V18 < 0$, while for $V17$ the difference increases with the variable values for $V17 > 0$. This is expected since ensemble tree classifiers usually shink predictions towards the average and are often not great for extrapolation outside the range of values observed in the training dataset.

pdp_LR_GB_V1_V12_V15_V27.png

The effect of $V1$ and $V15$ on the prediction seems to be underestimated by the GradientBoostingClassifier model. Both models similarly explain the effect of $V27$ on prediction which might explain why it was previously identified as important for explaining predictions of GradientBoostingClassifier (HW3, HW4). The LogisticRegression model does not capture the more complicated relationship between $V12$ and prediction. In theory, if we were to address those issues we might be able to improve model predictive performance. That would mean that previously selected observations $261$ and $679$ could be correctly classified.

Appendix¶

In [1]:
import dalex as dx
import lime
import shap

import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import random

import warnings
warnings.filterwarnings("ignore")

import plotly
plotly.offline.init_notebook_mode()

0. For the selected data set, train at least one tree-based ensemble model, e.g. random forest, gbdt, xgboost.¶

Load dataset¶

In [2]:
url = "https://raw.githubusercontent.com/adrianstando/imbalanced-benchmarking-set/main/datasets/qsar-biodeg.csv"
df = pd.read_csv(url,index_col=0)
# change target labels 2->1 and 1->0
df["TARGET"]-= 1
display(df.head())
X = df.loc[:, df.columns != 'TARGET']
y = df["TARGET"]
V1 V2 V8 V12 V13 V14 V15 V17 V18 V22 V27 V28 V30 V31 V36 V37 V39 TARGET
0 3.919 2.6909 31.4 0.000 3.106 2.550 9.002 0.960 1.142 1.201 1.932 0.011 0.000 4.489 2.949 1.591 7.253 1
1 4.170 2.1144 30.8 0.000 2.461 1.393 8.723 0.989 1.144 1.104 2.214 -0.204 0.000 1.542 3.315 1.967 7.257 1
2 3.932 3.2512 26.7 0.000 3.279 2.585 9.110 1.009 1.152 1.092 1.942 -0.008 0.000 4.891 3.076 2.417 7.601 1
3 3.000 2.7098 20.0 0.000 2.100 0.918 6.594 1.108 1.167 1.024 1.414 1.073 8.361 1.333 3.046 5.000 6.690 1
4 4.236 3.3944 29.4 -0.271 3.449 2.753 9.528 1.004 1.147 1.137 1.985 -0.002 10.348 5.588 3.351 2.405 8.003 1

Train-test split¶

In [3]:
# Spliting data
SEED = 0
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)
scaler = StandardScaler().fit(X_train)
#scaling - transforming
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# convert back to dataframes
X_train = pd.DataFrame(X_train, columns = X.columns, index = y_train.index)
X_test = pd.DataFrame(X_test, columns = X.columns, index = y_test.index)

Training¶

In [4]:
models = {
    'GradientBoostingClassifier': GradientBoostingClassifier(random_state=SEED, n_estimators=100),
    'LogisticRegression': LogisticRegression(random_state=SEED, penalty='l2', solver = 'lbfgs'),
    }

explainers = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    explainer = dx.Explainer(model, X_test, y_test)
    # save explainer
    explainers[name] = explainer
Preparation of a new explainer is initiated

  -> data              : 211 rows 17 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 211 values
  -> model_class       : sklearn.ensemble._gb.GradientBoostingClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_proba_default at 0x0000021749659CA0> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.00988, mean = 0.358, max = 0.983
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.926, mean = -0.0782, max = 0.934
  -> model_info        : package sklearn

A new explainer has been created!
Preparation of a new explainer is initiated

  -> data              : 211 rows 17 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 211 values
  -> model_class       : sklearn.linear_model._logistic.LogisticRegression (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_proba_default at 0x0000021749659CA0> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 9.77e-11, mean = 0.348, max = 0.98
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.968, mean = -0.0685, max = 0.994
  -> model_info        : package sklearn

A new explainer has been created!
In [5]:
performance = pd.concat([explainer.model_performance().result for explainer in explainers.values()], axis=0)
performance
Out[5]:
recall precision f1 accuracy auc
GradientBoostingClassifier 0.898305 0.688312 0.779412 0.857820 0.925680
LogisticRegression 0.762712 0.714286 0.737705 0.848341 0.899309

1. Calculate the predictions for some selected observations.¶

Selecting observations¶

In [6]:
np.random.seed(SEED)
# y_test_0 = y_test[y_test == 0]
# y_test_1 = y_test[y_test == 1]

# idx_0 = np.random.choice(y_test_0.index, 1)
# idx_1 = np.random.choice(y_test_1.index, 1)
# idx = np.concatenate((idx_1, idx_0))
idx = np.array([261, 679])

obs_var = X_test.loc[idx]
obs_target = y_test[idx].to_frame()
obs_df = obs_var.join(obs_target)
display(obs_df)
V1 V2 V8 V12 V13 V14 V15 V17 V18 V22 V27 V28 V30 V31 V36 V37 V39 TARGET
261 0.570528 -1.026712 -0.199308 2.147546 0.090398 0.734119 0.567574 -0.658983 0.114451 -0.294956 0.088071 -0.002628 0.152555 0.565992 -0.498760 -0.447733 -0.311315 1
679 1.289697 -0.922888 0.170087 2.225700 -0.573715 -0.644963 0.699339 -1.060742 -0.113295 0.506875 1.388732 -0.033523 -0.751856 -0.876388 -0.267852 -1.326784 -0.507680 0

Predictions¶

In [8]:
pred_df = obs_target.copy()
for name, model in models.items():
    pred_df[name] = model.predict(obs_var)
display(pred_df.T)
261 679
TARGET 1 0
GradientBoostingClassifier 1 1
LogisticRegression 0 0

2. Then, calculate the what-if explanations of these predictions using Ceteris Paribus profiles (also called What-if plots)¶

e.g. in Python: AIX360, Alibi, dalex, PDPbox; in R: pdp, DALEX, ALEplot. *implement CP yourself for a potential bonus point

In [10]:
explainer = explainers['GradientBoostingClassifier']
np.random.seed(SEED)
# cp = [explainer.predict_profile(new_observation=obs_var.loc[[i]]) for i in obs_var.index]
cp_all = explainer.predict_profile(new_observation=obs_var)
Calculating ceteris paribus: 100%|████████████████████████████████████████████████████| 17/17 [00:00<00:00, 366.50it/s]
In [12]:
cp_all.plot(variables = ["V1", "V12", "V15", "V27"], color='_ids_')

3. Find two observations in the data set, such that they have different CP profiles.¶

For example, model predictions are increasing with age for one observation and decreasing with age for another one. NOTE that you will need to have a model with interactions to observe such differences.

The observations $261, 679$ satisfy this condition. See above.

4. Compare CP, which is a local explanation, with PDP, which is a global explanation.¶

*implement PDP yourself for a potential bonus point

In [14]:
explainer = explainers['GradientBoostingClassifier']
pdp = explainer.model_profile() 
Calculating ceteris paribus: 100%|█████████████████████████████████████████████████████| 17/17 [00:00<00:00, 38.48it/s]
In [26]:
# pdp.plot(variables = ["V1", "V12", "V15", "V27"], geom="profiles", title="Partial Dependence Plot with individual profiles")
pdp.plot(variables = ["V1", "V12", "V15", "V27"], title="Partial Dependence Plot")
In [28]:
# pdp.plot(variables=["V18", "V17"], geom="profiles", title="Partial Dependence Plot with individual profiles")
pdp.plot(variables=["V18", "V17"], title="Partial Dependence Plot")

5. Compare PDP between between at least two different models.¶

In [19]:
explainer = explainers['LogisticRegression']
pdp_lr = explainer.model_profile() 
Calculating ceteris paribus: 100%|████████████████████████████████████████████████████| 17/17 [00:00<00:00, 106.23it/s]
In [20]:
pdp_lr.plot(pdp, variables=["V18", "V17"], title="Partial Dependence Plot")
In [22]:
pdp_lr.plot(pdp, variables=["V1", "V12", "V15", "V27"], title="Partial Dependence Plot")
In [25]:
pdp_lr.plot(pdp, title="Partial Dependence Plot")